Distributional Lexical Semantics for Stop Lists

نویسندگان

  • Mr Neil Cooke
  • Lee Gillam
چکیده

In this paper, we consider the use of techniques that lead naturally towards using distributional lexical semantics for the automatic construction of corpora-specific stop word lists. We propose and evaluate a method for calculating stop words based on collocation, frequency information and comparisons of distributions within and across samples. This method is tested against the Enron email corpus and the MuchMore Springer Bilingual Corpus of medical abstracts. We identify some of the data cleansing challenges related to the Enron corpus, and particularly how these necessarily relate to the profile of a corpus. We further consider how we can and should investigate behaviours of subsamples of such a corpus to ascertain whether the lexical semantic techniques employed might be used to identify and classify variations in contextual use of keywords that may help towards content separation in “unclean” collections: the challenge here is the separation of keywords in the same or very similar contexts, that may be conceived as a “pragmatic difference”. Such work may also be applicable to initiatives in which the focus is on constructing (clean) corpora from the web, deriving knowledge resources from wikis, and finding key information within other textual social media.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic extraction of potential examples of semantic change using lexical sets

This paper describes ongoing work on automatically finding candidates for semantic change by comparing two corpora from different time periods. Semantic change is viewed in terms of distributional difference with a computational and linguistically motivated approach. The data is parsed, lemmatized and part of speech information is added. In distributional semantics, meaning is characterized wit...

متن کامل

Can distributional approaches improve on Good Old-Fashioned Lexical Semantics?

In this position paper, I discuss some linguistic problems that computational work on lexical semantics has attempted to address in the past and the implications for alternative models which incorporate distributional information. I concentrate in particular on phenomena involving count/mass distinctions, where older approaches attempted to use lexical semantics in their models of syntax. I out...

متن کامل

A Hybrid Distributional and Knowledge-based Model of Lexical Semantics

A range of approaches to the representation of lexical semantics have been explored within Computational Linguistics. Two of the most popular are distributional and knowledgebased models. This paper proposes hybrid models of lexical semantics that combine the advantages of these two approaches. Our models provide robust representations of synonymous words derived from WordNet. We also make use ...

متن کامل

Identifying Lexical Relationships and Entailments with Distributional Semantics

As the field of Natural Language Processing has developed, research has progressed on ambitious semantic tasks like Recognizing Textual Entailment (RTE). Systems that approach these tasks may perform sophisticated inference between sentences, but often depend heavily on lexical resources like WordNet to provide critical information about relationships and entailments between lexical items. Howe...

متن کامل

Vector spaces for historical linguistics: Using distributional semantics to study syntactic productivity in diachrony

This paper describes an application of distributional semantics to the study of syntactic productivity in diachrony, i.e., the property of grammatical constructions to attract new lexical items over time. By providing an empirical measure of semantic similarity between words derived from lexical co-occurrences, distributional semantics not only reliably captures how the verbs in the distributio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008